This is my report which analyzes grad school admissions data. I have included various models and predictions. I explore the relationships between GRE, GPA, and rank with admissions, and compare different models of it.
Load required libraries
library(tidyverse)
library(dplyr)
library(ggplot2)
library(tidyr)
library(performance)
library(plotly)
library(reshape2)
library(magrittr)
library(vegan)
Data set
data <- read.csv(“GradSchool_Admissions.csv”)
Display column names and summary
colnames(data)
summary(data)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.0.4
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ purrr::%||%() masks base::%||%()
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
##
## Attaching package: 'plotly'
##
## The following object is masked from 'package:ggplot2':
##
## last_plot
##
## The following object is masked from 'package:stats':
##
## filter
##
## The following object is masked from 'package:graphics':
##
## layout
##
## Attaching package: 'reshape2'
##
## The following object is masked from 'package:tidyr':
##
## smiths
##
## Attaching package: 'magrittr'
##
## The following object is masked from 'package:purrr':
##
## set_names
##
## The following object is masked from 'package:tidyr':
##
## extract
## Loading required package: permute
## Loading required package: lattice
# Data set
data <- read.csv("GradSchool_Admissions.csv")
# Display column names and summary
colnames(data) ## [1] "admit" "gre" "gpa" "rank"
## admit gre gpa rank
## Min. :0.0000 Min. :220.0 Min. :2.260 Min. :1.000
## 1st Qu.:0.0000 1st Qu.:520.0 1st Qu.:3.130 1st Qu.:2.000
## Median :0.0000 Median :580.0 Median :3.395 Median :2.000
## Mean :0.3175 Mean :587.7 Mean :3.390 Mean :2.485
## 3rd Qu.:1.0000 3rd Qu.:660.0 3rd Qu.:3.670 3rd Qu.:3.000
## Max. :1.0000 Max. :800.0 Max. :4.000 Max. :4.000
# Build basic models
model_gre_admit <- glm(admit ~ gre, data = data, family = binomial)
summary(model_gre_admit)##
## Call:
## glm(formula = admit ~ gre, family = binomial, data = data)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.901344 0.606038 -4.787 1.69e-06 ***
## gre 0.003582 0.000986 3.633 0.00028 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 499.98 on 399 degrees of freedom
## Residual deviance: 486.06 on 398 degrees of freedom
## AIC: 490.06
##
## Number of Fisher Scoring iterations: 4
##
## Call:
## glm(formula = admit ~ gpa, family = binomial, data = data)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -4.3576 1.0353 -4.209 2.57e-05 ***
## gpa 1.0511 0.2989 3.517 0.000437 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 499.98 on 399 degrees of freedom
## Residual deviance: 486.97 on 398 degrees of freedom
## AIC: 490.97
##
## Number of Fisher Scoring iterations: 4
##
## Call:
## glm(formula = admit ~ rank, family = binomial, data = data)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.6366 0.3061 2.080 0.0375 *
## rank -0.5863 0.1240 -4.728 2.26e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 499.98 on 399 degrees of freedom
## Residual deviance: 475.71 on 398 degrees of freedom
## AIC: 479.71
##
## Number of Fisher Scoring iterations: 4
Just want to see the plot here..
## [1] 0.3842659
##
## Pearson's product-moment correlation
##
## data: data$gre and data$gpa
## t = 8.3036, df = 398, p-value = 1.596e-15
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.2974203 0.4648047
## sample estimates:
## cor
## 0.3842659
# Data frame for plotting predicted probabilities
plot_data <- data.frame(
gre = data$gre,
gpa = data$gpa,
rank = data$rank,
admit = data$admit,
pred_gre = predict(model_gre_admit, type = "response"),
pred_gpa = predict(model_gpa_admit, type = "response"),
pred_rank = predict(model_rank_admit, type = "response"))
# Plotting predicted probabilities for GRE, GPA, and Rank
ggplot(plot_data, aes(x = gre, y = pred_gre)) +
geom_line(color = "blue") +
geom_point(aes(y = admit), alpha = 0.3) +
labs(title = "Admission Probability vs GRE Score", x = "GRE Score", y = "Admission Probability") +
theme_minimal()That didn’t turn out how I wanted, didn’t give me a lot of information. I decided to tweak it a bit and run it by comparing models.
# Models with varying complexities
model_gre_gpa_admit <- glm(admit ~ gre + gpa, data = data, family = binomial)
model_full_admit <- glm(admit ~ gre + gpa + rank, data = data, family = binomial)
# Compare model performance
comp <- compare_performance(model_gre_gpa_admit, model_full_admit) %>%
plot()
print(comp)I liked this more, but I wanted a 3D image, and I ended up needing help on this part.
gre_vals <- seq(min(data$gre), max(data$gre), length.out = 100)
gpa_vals <- seq(min(data$gpa), max(data$gpa), length.out = 100)
rank_levels <- unique(data$rank)
prediction_grid <- expand.grid(gre = gre_vals,
gpa = gpa_vals,
rank = rank_levels)
prediction_grid$predicted_prob <- predict(model_full_admit, newdata = prediction_grid, type = "response")
# 3D Plot using plotly
prediction_grid_rank1 <- prediction_grid[prediction_grid$rank == 1, ]
z_matrix <- acast(prediction_grid_rank1, gpa ~ gre, value.var = "predicted_prob")
plotly::plot_ly(
x = gre_vals,
y = gpa_vals,
z = z_matrix
) %>%
plotly::add_surface(colorscale = "Blues") %>%
plotly::layout(
title = list(text = "Predicted Admission Probability (GRE * GPA Interaction, Rank 1)"),
scene = list(
xaxis = list(title = "GRE"),
yaxis = list(title = "GPA"),
zaxis = list(title = "Admission Probability")
)
)Although its pretty cool, I still am just playing with
predictions.
Now I wanted to test the model performance!
# Train/test split
set.seed(123)
train_indices <- sample(1:nrow(data), size = 0.7 * nrow(data))
train_data <- data[train_indices, ]
test_data <- data[-train_indices, ]
# Fit the interaction model
model_interact <- glm(admit ~ gre * gpa + rank, data = train_data, family = binomial)
# Predictions and accuracy
test_data$predicted_prob <- predict(model_interact, newdata = test_data, type = "response")
test_data$predicted_class <- ifelse(test_data$predicted_prob >= 0.5, 1, 0)
# Confusion matrix and accuracy
table(Predicted = test_data$predicted_class, Actual = test_data$admit)## Actual
## Predicted 0 1
## 0 78 23
## 1 11 8
## [1] 0.7166667
Now to train the data and test the predictions against the actual data!
# Train/test split
set.seed(123)
train_indices <- sample(1:nrow(data), size = 0.7 * nrow(data))
train_data <- data[train_indices, ]
test_data <- data[-train_indices, ]
# interaction model
model_interact <- glm(admit ~ gre * gpa + rank, data = train_data, family = binomial)
# add probabilites and make them binary
test_data$predicted_prob <- predict(model_interact, newdata = test_data, type = "response")
# convert to lables and a threshold
test_data$predicted_class <- ifelse(test_data$predicted_prob >= 0.5, 1, 0)
# confusion matrix
table(Predicted = test_data$predicted_class, Actual = test_data$admit)## Actual
## Predicted 0 1
## 0 78 23
## 1 11 8
## [1] 0.7166667
I was able to explore the data. Explored correlations between variables - GRE and GPA were found to have a weak correlation. We created Logistic Regression models based on admissions. We then built up more complex models and compared their performance - The interaction model performed the best. Then we had a lot of visualization option to show these performances. Then, I split the data to check for validation. I split it and then trained the interaction model and ran that against the real data.
Finally: My final model correctly predicted admission decisions at ~71.7% which in a regression model, this is a great result.
However: That also means there is ~28.3% unexplained. This could be due to noise in the data set, unmeasured factors, or inherent unpredictability.
Thanks!
# Select only numeric predictors (you can scale them if needed)
data_numeric <- data %>%
select(gre, gpa, rank)
# NMDS using Bray-Curtis dissimilarity (common for mixed data)
nmds_result <- metaMDS(data_numeric, distance = "euclidean", k = 2, trymax = 100)## Square root transformation
## Wisconsin double standardization
## Run 0 stress 2.149938e-18
## Run 1 stress 0.0001522356
## ... Procrustes: rmse 4.181156e-05 max resid 0.0003309861
## ... Similar to previous best
## Run 2 stress 0.0001165621
## ... Procrustes: rmse 3.056482e-05 max resid 0.0003244857
## ... Similar to previous best
## Run 3 stress 0.04298563
## Run 4 stress 6.841302e-05
## ... Procrustes: rmse 1.507768e-05 max resid 0.0001428006
## ... Similar to previous best
## Run 5 stress 0.02643435
## Run 6 stress 0.04880098
## Run 7 stress 0.03624875
## Run 8 stress 0.0001140337
## ... Procrustes: rmse 3.113174e-05 max resid 0.0001891792
## ... Similar to previous best
## Run 9 stress 0.02513349
## Run 10 stress 8.170448e-05
## ... Procrustes: rmse 1.963389e-05 max resid 0.000189902
## ... Similar to previous best
## Run 11 stress 0.02545743
## Run 12 stress 0.0536762
## Run 13 stress 9.656149e-05
## ... Procrustes: rmse 2.356572e-05 max resid 0.0004593796
## ... Similar to previous best
## Run 14 stress 0.0001233483
## ... Procrustes: rmse 3.505467e-05 max resid 0.0002209537
## ... Similar to previous best
## Run 15 stress 9.361161e-05
## ... Procrustes: rmse 2.49622e-05 max resid 0.0001717053
## ... Similar to previous best
## Run 16 stress 0.02545695
## Run 17 stress 0.02643403
## Run 18 stress 8.312258e-05
## ... Procrustes: rmse 2.027429e-05 max resid 0.0002742561
## ... Similar to previous best
## Run 19 stress 0.02513339
## Run 20 stress 0.04298578
## *** Best solution repeated 9 times
## Warning in metaMDS(data_numeric, distance = "euclidean", k = 2, trymax = 100):
## stress is (nearly) zero: you may have insufficient data
## [1] 2.149938e-18
# Extract NMDS scores (coordinates in 2D space)
nmds_points <- as.data.frame(nmds_result$points)
nmds_points$admit <- as.factor(data$admit)
# Plot using ggplot
ggplot(nmds_points, aes(x = MDS1, y = MDS2, color = admit)) +
geom_point(size = 3, alpha = 0.7) +
labs(title = "NMDS of Grad School Admissions",
x = "NMDS Dimension 1", y = "NMDS Dimension 2",
color = "Admission Status") +
theme_minimal()Similar scores cluster together - that would be good. 2 separate groups - admitted and non-admitted, that also would be great. I did not have the latter. I do think it shows 2 clusters, or at least the start of some clustering.